5 Using Character Prototypes in Text Recognition 4 Extraction of Character Prototypes 3 Recognition of Stop Words

نویسنده

  • YiHong Xu
چکیده

the words containing missing characters to the 43,300 word lexicon. Table 1: Results of regular expression matching with a 43,300 word lexicon (averaged over 200 pages). No. of words that are present on a test page 248.75 100.00% have missing characters 42.04 16.90% have a unique match of same length 26.74 10.75% have multiple matches of same length 11.27 4.53% have no matches of same length 4.13 1.66% 7 Conclusions We propose an adaptive text recognition strategy that is initiated by recognizing some common (stop) words. The recognized words are then aligned for extraction of common characters, and then recursively segmented using accumulated character prototypes. Character prototypes obtained from processing the stop words can then be used to train a classiier to recognize the rest of the text. By focusing on a relatively small set of words initially, more expensive word recognition techniques such as holistic word shape matching can be applied, so that the method can be applied to lower quality images. In several stages reject criteria are introduced to defend against unreliable recognition and character or word matches. The strategy uses a characteristic mix of whole word and isolated character based methods. Also, other techniques such as string matching and dynamic programming are found useful in the process. Moreover , we do not intend to completely abandon multi-font classiiers. Instead, we consider them as a supplementary tool for ambiguities that cannot be solved by the proposed procedure. Those include text in a non-dominant font (such as italics, boldfaces, numbers, or headers) and words with missing characters that cannot be matched uniquely to words in a general purpose lexicon. The study of the eeect of missing characters shows that the use of a multi-font classiier can be reduced to a minimum. Acknowledgements The author thanks George Nagy for reference 2], and him and YiHong Xu for helpful discussions and sharing of their prior work; Henry Baird for the defect model; John Hobby for trials of his bitmap matching and averaging algorithms; and Dan Lopresti and Raf-faele Giancarlo for discussions on string matching. Figure 3: Two pairs of words aligned for extraction of common characters (\a" in the left pair, \n" in the right pair). baseline of the both words, and a small local search is performed for the best vertical shift. The column-wise scores are summed over the width of the region which is dependent on the estimated character …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Identification of Stop Words for Font Learning and Keyword Spotting

A recently proposed adaptive strategy for text recognition uses a linguistic fact that over half of the words on a typical English page are among 150 common stop words. The small lexicon permits word-shape based recognition that yields word identities from which character prototypes can be extracted. This paper describes a fast procedure for locating the best candidates for those stop words. Th...

متن کامل

Neural Network Based Recognition System Integrating Feature Extraction and Classification for English Handwritten

Handwriting recognition has been one of the active and challenging research areas in the field of image processing and pattern recognition. It has numerous applications that includes, reading aid for blind, bank cheques and conversion of any hand written document into structural text form. Neural Network (NN) with its inherent learning ability offers promising solutions for handwritten characte...

متن کامل

Automatic writer identification framework for online handwritten documents using character prototypes

This paper proposes an automatic text-independent writer identification framework that integrates an industrial handwriting recognition system, which is used to perform an automatic segmentation of an online handwritten document at the character level. Subsequently, a fuzzy c-means approach is adopted to estimate statistical distributions of character prototypes on an alphabet basis. These dist...

متن کامل

Handwritten Character Recognition using Modified Gradient Descent Technique of Neural Networks and Representation of Conjugate Descent for Training Patterns

The purpose of this study is to analyze the performance of Back propagation algorithm with changing training patterns and the second momentum term in feed forward neural networks. This analysis is conducted on 250 different words of three small letters from the English alphabet. These words are presented to two vertical segmentation programs which are designed in MATLAB and based on portions (1...

متن کامل

Building compact recognizer with recognition rate maintained for on-line handwritten Japanese text recognition

The paper presents complexity reduction of an on-line handwritten Japanese text recognition system by selecting an optimal off-line recognizer in combination with an on-line recognizer, geometric context evaluation, and linguistic context evaluation. The result is that a surprisingly simple off-line recognizer, which is weak on its own, produces nearly the best recognition rate in combination w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998